feat(influx_tools): Add export to parquet files #25297

srebhan · 2024-09-09T11:56:09Z

Closes #
Superseeds #25253

Describe your proposed changes here.

I've read the contributing section of the project README.
Signed CLA (if not already signed).

This PR adds a command to export data into per-shard parquet files. To do so, the command iterates over the shards, creates a cumulative schema over the series of a measurement (i.e. a super-set of tags and fields) and exports the data to a parquet file per measurement and shard.

To test the tool run

go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config influxdb.conf -database telegraf

.circleci/config.yml

cmd/influx_tools/main.go

cmd/influx_tools/parquet/batcher.go

cmd/influx_tools/parquet/command.go

cmd/influx_tools/parquet/exporter.go

davidby-influx

I did a quick review, but I'm not familiar with arrow and certainly missed some things. I can do a more thorough review if we paired to walk through the algorithm once.

cmd/influx_tools/parquet/schema.go

srebhan · 2024-09-18T13:49:05Z

@davidby-influx thanks for the thorough review! I tried to address all issues and commented on the three unresolved ones. Will schedule a meeting for walking through the code. Thanks again!

cmd/influx_tools/parquet/batcher.go

davidby-influx

LGTM

cmd/influx_tools/parquet/cursors.go

alespour · 2024-10-02T11:13:19Z

~~I'm not sure what to make of this: I have v1 db with several measurements like cpu disk etc, each with ~8M rows~~

> select count(usage_user) from cpu name: cpu time count ---- ----- 0 8631360

The same query returns different number of rows in exported Parquet "db":

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'all/cpu-*.parquet'" count(usage_user) ----------------- 28771200

Log attached.
cpu-export.log

alespour · 2024-10-02T14:06:08Z

tested measurement without tags - OK
tested single & all measurements export - OK, except the discrepancy of number of rows

Tested with db with simulating 1-month of monitoring data of a small data center (9 measurements like cpu, disk etc, 10 tags). DB files size on disk 4.1 GB, 5 shards.

Exported Parquet size on disk 11 GB, took 1h6m on somewhat obsolete laptop (Core i7 CPU, 8-core, 16 GB RAM, SSD). Memory usage during export was stable (RSS peak ~2 GB).

InfluxDB measuement structure example:

> show tag keys from cpu
name: cpu
tagKey
------
arch
datacenter
hostname
os
rack
region
service
service_environment
service_version
team

> show field keys from cpu
name: cpu
fieldKey         fieldType
--------         ---------
usage_guest      float
usage_guest_nice float
usage_idle       float
usage_iowait     float
usage_irq        float
usage_nice       float
usage_softirq    float
usage_steal      float
usage_system     float
usage_user       float

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "describe select * from 'all/cpu-*.parquet'"

column_name          column_type  null  key  default  extra
-------------------  -----------  ----  ---  -------  -----
time                 TIMESTAMP    YES                      
arch                 VARCHAR      YES                      
datacenter           VARCHAR      YES                      
hostname             VARCHAR      YES                      
os                   VARCHAR      YES                      
rack                 VARCHAR      YES                      
region               VARCHAR      YES                      
service              VARCHAR      YES                      
service_environment  VARCHAR      YES                      
service_version      VARCHAR      YES                      
team                 VARCHAR      YES                      
usage_guest          DOUBLE       YES                      
usage_guest_nice     DOUBLE       YES                      
usage_idle           DOUBLE       YES                      
usage_iowait         DOUBLE       YES                      
usage_irq            DOUBLE       YES                      
usage_nice           DOUBLE       YES                      
usage_softirq        DOUBLE       YES                      
usage_steal          DOUBLE       YES                      
usage_system         DOUBLE       YES                      
usage_user           DOUBLE       YES

Measurement without tags:

alespour@master-node:/bigdata/x$ duckdb -column -s "select * from 'notags/*.parquet'"
time                        lat    lon  
--------------------------  -----  -----
2024-10-02 13:03:55.643371  49.95  14.47
2024-10-02 13:04:04.423014  49.91  14.49
2024-10-02 13:04:12.726653  49.94  14.53

alespour · 2024-10-02T14:19:02Z

I will repeat the test to verify the number of rows (mis)match.

alespour · 2024-10-02T18:09:54Z

My apologies, it was a mistake on my side. Row count matches.

InfluxDB:

> select count(usage_user) from cpu
name: cpu
time count
---- -----
0    28771200

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'cpu/*.parquet'"
count(usage_user)
-----------------
28771200

alespour · 2024-10-03T08:08:49Z

tested other types - OK

Creating the following schemata for 1 measurement(s):
  Measurement "types" with 0 tag(s) and  5 field(s):
    Column	Kind		Datatype
    ------	----		--------
    time	timestamp	timestamp (nanosecond)
    label	field		string
    lat		field		float
    lon		field		float
    match	field		boolean
    scale	field		integer

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "describe from 'types/*.parquet'"
column_name  column_type  null  key  default  extra
-----------  -----------  ----  ---  -------  -----
time         TIMESTAMP    YES                      
label        VARCHAR      YES                      
lat          DOUBLE       YES                      
lon          DOUBLE       YES                      
match        BOOLEAN      YES                      
scale        BIGINT       YES

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "select * from 'types/*.parquet' limit 1"
time                        label  lat    lon    match  scale
--------------------------  -----  -----  -----  -----  -----
2024-10-03 07:58:33.419431  a1     49.94  14.53  true   4

alespour · 2024-10-03T08:13:15Z

It's GTG by me 👍

srebhan · 2024-11-18T13:49:29Z

To run the exporter in this PR do the following (assuming you are using a BASH-compatible shell)

Clone the repo and checkout the PR

# git clone https://github.com/influxdata/influxdb.git
# cd influxdb/
# git fetch origin pull/25297/head:v1-bulk-exporter-parquet 
# git checkout v1-bulk-exporter-parquet

Build InfluxDB v1

# export PKG_CONFIG=${PWD}/pkg-config.sh
# go build ./...

Run the exporter (with the help flag)

# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet --help

Run the exporter with the config of an existing server instance

# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config <path to influxdb config dir>/influxdb.conf -database <database to export>

dburton-influxdata · 2024-11-18T22:12:13Z

Do we have a compiled version to test with or do I still need to clone the repo and build the go binary?

dburton-influxdata · 2024-11-18T22:18:45Z

I converted all of the BASH commands into a Python script and ran. It generates an error during the build.
exporter_build_errors_python exporter_script.txt

dburton-influxdata · 2024-11-18T22:19:48Z

Here is the Python script in Zip format for Github.
exporter_script.zip

srebhan · 2024-11-21T21:41:04Z

@dburton-influxdata using os.environ does NOT export the variable to subprocesses like the go command! You would need to use os.putenv but I don't understand why you need to use python for the whole thing...

jwei-influx · 2024-12-19T17:39:12Z

I'll be taking this over from Darren. A couple of questions regarding this tool I do have:

Has there been any consideration about how the tool is intended to handle mixed field type shards?
If we need to do any sort of custom partitioning on the eventual 3.0 system, are we able to do that with this tool? Or conversely, is the resulting parquet file from this tool able to be slotted in behind a custom partitioning scheme that is pre-applied to the 3.0 instance?
Are we able to do any sort of manipulation of the tags and fields using this tool, or potentially by editing the resulting parquet files?
Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

I might have more questions as I test the tool, but these are the ones that are top of mind for me right now.

srebhan force-pushed the v1-bulk-exporter-parquet branch 2 times, most recently from 6869ba3 to bd44db9 Compare September 9, 2024 14:12

srebhan force-pushed the v1-bulk-exporter-parquet branch from 7c930bb to 2bb73ce Compare September 17, 2024 19:39

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

feat(influx_tools): Add export to parquet files

46aef0b

srebhan force-pushed the v1-bulk-exporter-parquet branch from 2bb73ce to 46aef0b Compare September 18, 2024 10:41

srebhan added 9 commits September 18, 2024 12:45

chore: Wrap errors in influx_tools main

9ed1d01

chore: Do not create unused series cursor and simplify batcher creation

c12f293

chore: Move converter creation to batcher as it is only used there

a2367ee

fix: Caputure error when closing series cursor

41dacce

feat: Print shard series-file path on error

b7c9475

chore: Replace panic by returning an error

182195f

feat: Use logger instead of raw printing

795e581

fix: Caputure error when closing exporter

59b60e6

fix: Caputure more defer errors

390cf30

feat: Detect name conflicts after name resolution

3bfe17c

davidby-influx assigned srebhan Sep 18, 2024

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved

fix: Make sure deferred functions are actually called

76a88d1

srebhan force-pushed the v1-bulk-exporter-parquet branch from 23e7a05 to d7216ca Compare September 19, 2024 20:01

srebhan added 2 commits September 19, 2024 22:03

feat: Move out cursor handling

f2423af

feat: Preallocate maps and slices

a7d0f1b

srebhan force-pushed the v1-bulk-exporter-parquet branch from d7216ca to a7d0f1b Compare September 19, 2024 20:03

davidby-influx approved these changes Sep 19, 2024

View reviewed changes

cmd/influx_tools/parquet/cursors.go Show resolved Hide resolved

srebhan mentioned this pull request Oct 2, 2024

feat: influx_tools export parquet #25253

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(influx_tools): Add export to parquet files #25297

feat(influx_tools): Add export to parquet files #25297

srebhan commented Sep 9, 2024 •

edited

Loading

davidby-influx left a comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 •

edited

Loading

alespour commented Oct 3, 2024

srebhan commented Nov 18, 2024 •

edited

Loading

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

srebhan commented Nov 21, 2024

jwei-influx commented Dec 19, 2024

feat(influx_tools): Add export to parquet files #25297

Are you sure you want to change the base?

feat(influx_tools): Add export to parquet files #25297

Conversation

srebhan commented Sep 9, 2024 • edited Loading

davidby-influx left a comment

Choose a reason for hiding this comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

Choose a reason for hiding this comment

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 • edited Loading

alespour commented Oct 3, 2024

srebhan commented Nov 18, 2024 • edited Loading

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

srebhan commented Nov 21, 2024

jwei-influx commented Dec 19, 2024

srebhan commented Sep 9, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 3, 2024 •

edited

Loading

srebhan commented Nov 18, 2024 •

edited

Loading